17 research outputs found
Long-term Leap Attention, Short-term Periodic Shift for Video Classification
Video transformer naturally incurs a heavier computation burden than a static
vision transformer, as the former processes times longer sequence than the
latter under the current attention of quadratic complexity . The
existing works treat the temporal axis as a simple extension of spatial axes,
focusing on shortening the spatio-temporal sequence by either generic pooling
or local windowing without utilizing temporal redundancy.
However, videos naturally contain redundant information between neighboring
frames; thereby, we could potentially suppress attention on visually similar
frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a
long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term
``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video
transformers, with complexity. Specifically, the ``LA'' groups
long-term frames into pairs, then refactors each discrete pair via attention.
The ``\textit{P}-Shift'' exchanges features between temporal neighbors to
confront the loss of short-term dynamics. By replacing a vanilla 2D attention
with the LAPS, we could adapt a static transformer into a video one, with zero
extra parameters and neglectable computation overhead (2.6\%).
Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS
transformer could achieve competitive performances in terms of accuracy, FLOPs,
and Params among CNN and transformer SOTAs. We open-source our project in
\sloppy
\href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .Comment: Accepted by ACM Multimedia 2022, 10 pages, 4 figure
Re-Attention Transformer for Weakly Supervised Object Localization
Weakly supervised object localization is a challenging task which aims to
localize objects with coarse annotations such as image categories. Existing
deep network approaches are mainly based on class activation map, which focuses
on highlighting discriminative local region while ignoring the full object. In
addition, the emerging transformer-based techniques constantly put a lot of
emphasis on the backdrop that impedes the ability to identify complete objects.
To address these issues, we present a re-attention mechanism termed token
refinement transformer (TRT) that captures the object-level semantics to guide
the localization well. Specifically, TRT introduces a novel module named token
priority scoring module (TPSM) to suppress the effects of background noise
while focusing on the target object. Then, we incorporate the class activation
map as the semantically aware input to restrain the attention map to the target
object. Extensive experiments on two benchmarks showcase the superiority of our
proposed method against existing methods with image category annotations.
Source code is available in
\url{https://github.com/su-hui-zz/ReAttentionTransformer}.Comment: 11 pages, 5 figure
NLPBench: Evaluating Large Language Models on Solving NLP Problems
Recent developments in large language models (LLMs) have shown promise in
enhancing the capabilities of natural language processing (NLP). Despite these
successes, there remains a dearth of research dedicated to the NLP
problem-solving abilities of LLMs. To fill the gap in this area, we present a
unique benchmarking dataset, NLPBench, comprising 378 college-level NLP
questions spanning various NLP topics sourced from Yale University's prior
final exams. NLPBench includes questions with context, in which multiple
sub-questions share the same public information, and diverse question types,
including multiple choice, short answer, and math. Our evaluation, centered on
LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting
strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study
reveals that the effectiveness of the advanced prompting strategies can be
inconsistent, occasionally damaging LLM performance, especially in smaller
models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated
specific shortcomings in LLMs' scientific problem-solving skills, with
weaknesses in logical decomposition and reasoning notably affecting results
Cross-Modality High-Frequency Transformer for MR Image Super-Resolution
Improving the resolution of magnetic resonance (MR) image data is critical to
computer-aided diagnosis and brain function analysis. Higher resolution helps
to capture more detailed content, but typically induces to lower
signal-to-noise ratio and longer scanning time. To this end, MR image
super-resolution has become a widely-interested topic in recent times. Existing
works establish extensive deep models with the conventional architectures based
on convolutional neural networks (CNN). In this work, to further advance this
research field, we make an early effort to build a Transformer-based MR image
super-resolution framework, with careful designs on exploring valuable domain
prior knowledge. Specifically, we consider two-fold domain priors including the
high-frequency structure prior and the inter-modality context prior, and
establish a novel Transformer architecture, called Cross-modality
high-frequency Transformer (Cohf-T), to introduce such priors into
super-resolving the low-resolution (LR) MR images. Comprehensive experiments on
two datasets indicate that Cohf-T achieves new state-of-the-art performance
Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation
This study introduces an efficacious approach, Masked Collaborative Contrast
(MCC), to emphasize semantic regions in weakly supervised semantic
segmentation. MCC adroitly incorporates concepts from masked image modeling and
contrastive learning to devise Transformer blocks that induce keys to contract
towards semantically pertinent regions. Unlike prevalent techniques that
directly eradicate patch regions in the input image when generating masks, we
scrutinize the neighborhood relations of patch tokens by exploring masks
considering keys on the affinity matrix. Moreover, we generate positive and
negative samples in contrastive learning by utilizing the masked local output
and contrasting it with the global output. Elaborate experiments on commonly
employed datasets evidences that the proposed MCC mechanism effectively aligns
global and local perspectives within the image, attaining impressive
performance. The source code is available at
\url{https://github.com/fwu11/MCC}
Integrating UMLS Knowledge into Large Language Models for Medical Question Answering
Large language models (LLMs) have demonstrated powerful text generation
capabilities, bringing unprecedented innovation to the healthcare field. While
LLMs hold immense promise for applications in healthcare, applying them to real
clinical scenarios presents significant challenges, as these models may
generate content that deviates from established medical facts and even exhibit
potential biases. In our research, we develop an augmented LLM framework based
on the Unified Medical Language System (UMLS), aiming to better serve the
healthcare community. We employ LLaMa2-13b-chat and ChatGPT-3.5 as our
benchmark models, and conduct automatic evaluations using the ROUGE Score and
BERTScore on 104 questions from the LiveQA test set. Additionally, we establish
criteria for physician-evaluation based on four dimensions: Factuality,
Completeness, Readability and Relevancy. ChatGPT-3.5 is used for physician
evaluation with 20 questions on the LiveQA test set. Multiple resident
physicians conducted blind reviews to evaluate the generated content, and the
results indicate that this framework effectively enhances the factuality,
completeness, and relevance of generated content. Our research demonstrates the
effectiveness of using UMLS-augmented LLMs and highlights the potential
application value of LLMs in in medical question-answering.Comment: 12 pages, 3 figure
ViT-Calibrator: Decision Stream Calibration for Vision Transformer
A surge of interest has emerged in utilizing Transformers in diverse vision
tasks owing to its formidable performance. However, existing approaches
primarily focus on optimizing internal model architecture designs that often
entail significant trial and error with high burdens. In this work, we propose
a new paradigm dubbed Decision Stream Calibration that boosts the performance
of general Vision Transformers. To achieve this, we shed light on the
information propagation mechanism in the learning procedure by exploring the
correlation between different tokens and the relevance coefficient of multiple
dimensions. Upon further analysis, it was discovered that 1) the final decision
is associated with tokens of foreground targets, while token features of
foreground target will be transmitted into the next layer as much as possible,
and the useless token features of background area will be eliminated gradually
in the forward propagation. 2) Each category is solely associated with specific
sparse dimensions in the tokens. Based on the discoveries mentioned above, we
designed a two-stage calibration scheme, namely ViT-Calibrator, including token
propagation calibration stage and dimension propagation calibration stage.
Extensive experiments on commonly used datasets show that the proposed approach
can achieve promising results. The source codes are given in the supplements.Comment: 14pages, 12 figure
Propheter: Prophetic Teacher Guided Long-Tailed Distribution Learning
The problem of deep long-tailed learning, a prevalent challenge in the realm
of generic visual recognition, persists in a multitude of real-world
applications. To tackle the heavily-skewed dataset issue in long-tailed
classification, prior efforts have sought to augment existing deep models with
the elaborate class-balancing strategies, such as class rebalancing, data
augmentation, and module improvement. Despite the encouraging performance, the
limited class knowledge of the tailed classes in the training dataset still
bottlenecks the performance of the existing deep models. In this paper, we
propose an innovative long-tailed learning paradigm that breaks the bottleneck
by guiding the learning of deep networks with external prior knowledge. This is
specifically achieved by devising an elaborated ``prophetic'' teacher, termed
as ``Propheter'', that aims to learn the potential class distributions. The
target long-tailed prediction model is then optimized under the instruction of
the well-trained ``Propheter'', such that the distributions of different
classes are as distinguishable as possible from each other. Experiments on
eight long-tailed benchmarks across three architectures demonstrate that the
proposed prophetic paradigm acts as a promising solution to the challenge of
limited class knowledge in long-tailed datasets. Our code and model can be
found in the supplementary material
A Survey of Neural Trees
Neural networks (NNs) and decision trees (DTs) are both popular models of
machine learning, yet coming with mutually exclusive advantages and
limitations. To bring the best of the two worlds, a variety of approaches are
proposed to integrate NNs and DTs explicitly or implicitly. In this survey,
these approaches are organized in a school which we term as neural trees (NTs).
This survey aims to present a comprehensive review of NTs and attempts to
identify how they enhance the model interpretability. We first propose a
thorough taxonomy of NTs that expresses the gradual integration and
co-evolution of NNs and DTs. Afterward, we analyze NTs in terms of their
interpretability and performance, and suggest possible solutions to the
remaining challenges. Finally, this survey concludes with a discussion about
other considerations like conditional computation and promising directions
towards this field. A list of papers reviewed in this survey, along with their
corresponding codes, is available at:
https://github.com/zju-vipa/awesome-neural-treesComment: 35 pages, 7 figures and 1 tabl